In [9]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()

PCA

Earlier we explored an example of using a PCA projection as a feature selector for facial recognition with a support vector machine.
Here we will take a look back and explore a bit more of what went into that. Recall that we were using the Labeled Faces dataset made available through Scikit-Learn:


In [10]:
from sklearn.datasets import fetch_lfw_people
faces = fetch_lfw_people(min_faces_per_person=60)
print(faces.target_names)
print(faces.images.shape)


['Ariel Sharon' 'Colin Powell' 'Donald Rumsfeld' 'George W Bush'
 'Gerhard Schroeder' 'Hugo Chavez' 'Junichiro Koizumi' 'Tony Blair']
(1348, 62, 47)

Similarly to the lecture we will apply a PCA dimensionality reduction to 150 dimensions:


In [11]:
from sklearn.decomposition import PCA
pca = PCA(150)
pca.fit(faces.data)


Out[11]:
PCA(copy=True, iterated_power='auto', n_components=150, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)

In [12]:
pca.components_.shape


Out[12]:
(150, 2914)

If you take a closer look at the shape of the components array, you may notice something. What are the factors that determine the shape of the components array? Pause for a while and think about it carefully.

If you think you have discovered the correct answer, execute the next cell.


In [ ]:
%load PCA_solution.py

Since the principal components have to same size as our features, I want you to interpret each component as an image and plot it. That is, I want you to plot the first 24 components similarly to what we did in the SVM lecture. Of course there are no names to plot.


In [ ]:

Have a close look at the faces, think about the interpretation of these plots. Develop a hypothesis that explains why theses images look the way they are and why it is useful for the task at hand. We will discuss this in the next lecture.

During the lecture we've talked about the amount of variance each component explains. Plot the cumulative sum of the explained variance.
Hint 1: The cumulative sum of [1, 2, 3] is [1, 3, 6].
Hint 2: Numpy's cumsum may come in handy.
Hint 3: Don't use pca.explained_variance_. Instead use pca.explained_variance_ratio_ which makes the variance sum to 1.


In [ ]:

If you accomplished the previous task you may have come to the comclusion that these 150 components account for nearly 95% of the variance. So what happens if we reconstruct the images from those 150 components?
Compare the input images (say, the first 10 of our dataset) with the images reconstructed from these 150 components, by ploting them alongside.
Hint: You have to fit the model, transform the data and then do the inverse transform that we saw already.


In [ ]: